library(ggplot2)
library(dtplyr)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.2 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ lubridate 1.9.2 ✔ tibble 3.2.1
## ✔ purrr 1.0.2 ✔ tidyr 1.3.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(GGally)
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
library(MASS)
##
## Attaching package: 'MASS'
##
## The following object is masked from 'package:dplyr':
##
## select
library(caret)
## Loading required package: lattice
##
## Attaching package: 'caret'
##
## The following object is masked from 'package:purrr':
##
## lift
houseprice_dataset <- read.csv("~/Desktop/Applied Statistics/houseprice_dataset.txt", sep=";")
d <- read.csv("~/Desktop/Applied Statistics/houseprice_dataset.txt", sep=";")
summary(d)
## n.rooms build.year postalcode square_meters
## Min. : 1.000 Min. :1900 Length:1011 Min. : 56.0
## 1st Qu.: 2.000 1st Qu.:1960 Class :character 1st Qu.: 98.3
## Median : 3.000 Median :1991 Mode :character Median :110.3
## Mean : 3.247 Mean :1982 Mean :112.5
## 3rd Qu.: 4.000 3rd Qu.:2007 3rd Qu.:124.7
## Max. :222.000 Max. :2023 Max. :213.7
## swimpool_w1km school_distance type sold_within1week
## Min. :0.00000 Min. :0.0010 Length:1011 Min. :0.0000
## 1st Qu.:0.00000 1st Qu.:0.2035 Class :character 1st Qu.:0.0000
## Median :0.00000 Median :0.4480 Mode :character Median :0.0000
## Mean :0.04649 Mean :0.6094 Mean :0.3858
## 3rd Qu.:0.00000 3rd Qu.:0.7670 3rd Qu.:1.0000
## Max. :1.00000 Max. :5.3490 Max. :1.0000
## selling_price
## Min. : 32.71
## 1st Qu.: 55.29
## Median : 64.97
## Mean : 67.77
## 3rd Qu.: 76.34
## Max. :234.36
head(d)
## n.rooms build.year postalcode square_meters swimpool_w1km school_distance
## 1 3 1938 A2 96.4 0 0.101
## 2 5 2018 A7 132.6 0 2.170
## 3 4 1995 A4 119.7 0 0.372
## 4 1 1900 A1 72.2 0 0.855
## 5 1 1984 A4 56.0 0 1.283
## 6 3 1949 A2 94.6 0 0.103
## type sold_within1week selling_price
## 1 Apartment 0 43.27219
## 2 House 0 91.04185
## 3 Apartment 0 74.90384
## 4 Apartment 1 52.83142
## 5 Apartment 0 41.51371
## 6 Apartment 1 43.20737
plot(d)
ggpairs(d)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
dim(d)
## [1] 1011 9
From the str() function, it can be seen that, variables, soldwithin1week and swimpool_1km are supposed to be factors, but R coded them as integers.
str(d)
## 'data.frame': 1011 obs. of 9 variables:
## $ n.rooms : int 3 5 4 1 1 3 4 3 3 6 ...
## $ build.year : int 1938 2018 1995 1900 1984 1949 1966 2008 1943 1937 ...
## $ postalcode : chr "A2" "A7" "A4" "A1" ...
## $ square_meters : num 96.4 132.6 119.7 72.2 56 ...
## $ swimpool_w1km : int 0 0 0 0 0 0 0 0 0 0 ...
## $ school_distance : num 0.101 2.17 0.372 0.855 1.283 ...
## $ type : chr "Apartment" "House" "Apartment" "Apartment" ...
## $ sold_within1week: int 0 0 0 1 0 1 0 0 0 0 ...
## $ selling_price : num 43.3 91 74.9 52.8 41.5 ...
From the data set, it can be seen that there are 1011 rows or observations and 9 columns, by names; n.rooms, build.year, postalcode, square_meters, swipool_w1km, school_distance, type, sold_within1week and selling price.
The variables n.rooms, build.year, selling_price, school_distance and square_meters are numerical variables.
The variables postalcode, swimpool_wlkm, sold_within1week, type are categorical variables in this dataset.
Checking for duplication in the rows, I found out that there are 11 rows duplicated in this dataset.
nrow(d)
## [1] 1011
nrow(unique(d))
## [1] 1000
d_2 <- d[!duplicated(d), ]
dim(d_2)
## [1] 1000 9
I noticed that the variables postalcode, type, swimpool_w1k and sold_within1week had different levels (that is 1 = yes and 0 = no), however, it was not evident in the numerical summary.
Again, the variable type also had either Apartment or House, but was also not evident in the numerical summary
d_2$postalcode <- as.factor(d_2$postalcode)
d_2$type <- as.factor(d_2$type)
d_2$sold_within1week <- as.factor(d_2$sold_within1week)
summary(d_2)
## n.rooms build.year postalcode square_meters swimpool_w1km
## Min. : 1.00 Min. :1900 A1: 95 Min. : 56.00 Min. :0.000
## 1st Qu.: 2.00 1st Qu.:1961 A2:171 1st Qu.: 98.22 1st Qu.:0.000
## Median : 3.00 Median :1991 A3:153 Median :110.40 Median :0.000
## Mean : 3.25 Mean :1982 A4:207 Mean :112.46 Mean :0.047
## 3rd Qu.: 4.00 3rd Qu.:2007 A5:166 3rd Qu.:124.62 3rd Qu.:0.000
## Max. :222.00 Max. :2023 A6:114 Max. :213.70 Max. :1.000
## A7: 94
## school_distance type sold_within1week selling_price
## Min. :0.0010 Apartment:850 0:614 Min. : 32.71
## 1st Qu.:0.2030 House :150 1:386 1st Qu.: 55.22
## Median :0.4475 Median : 65.00
## Mean :0.6112 Mean : 67.77
## 3rd Qu.:0.7680 3rd Qu.: 76.33
## Max. :5.3490 Max. :234.36
##
model_d_2 <- glm(data = d_2, sold_within1week ~ ., family = "binomial")
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
summary(model_d_2)
##
## Call:
## glm(formula = sold_within1week ~ ., family = "binomial", data = d_2)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 5.168982 9.966027 0.519 0.603998
## n.rooms -0.402676 0.083625 -4.815 1.47e-06 ***
## build.year 0.000292 0.005137 0.057 0.954662
## postalcodeA2 -3.476893 0.455915 -7.626 2.42e-14 ***
## postalcodeA3 -3.898849 0.456763 -8.536 < 2e-16 ***
## postalcodeA4 -3.350284 0.505090 -6.633 3.29e-11 ***
## postalcodeA5 -3.514069 0.537158 -6.542 6.07e-11 ***
## postalcodeA6 -2.940972 0.617407 -4.763 1.90e-06 ***
## postalcodeA7 -3.212262 0.623885 -5.149 2.62e-07 ***
## square_meters -0.004179 0.005444 -0.768 0.442698
## swimpool_w1km 1.189388 0.352865 3.371 0.000750 ***
## school_distance -0.531551 0.159205 -3.339 0.000841 ***
## typeHouse -0.954252 0.258480 -3.692 0.000223 ***
## selling_price -0.016438 0.007482 -2.197 0.028012 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1333.9 on 999 degrees of freedom
## Residual deviance: 1057.9 on 986 degrees of freedom
## AIC: 1085.9
##
## Number of Fisher Scoring iterations: 6
Given the below input, there is 27.3% chance that an apartment with these characteristics provided will be sold in 1 week.
new_data <- data.frame(n.rooms = 3, build.year = 1990, postalcode = "A2", square_meters = 101, swimpool_w1km = 0, school_distance = 1.2, type = "Apartment", selling_price = 60)
predict(model_d_2, newdata = new_data, type = "response")
## 1
## 0.2726939
Yes, the seller would have a higher probability of 55.2% for selling the Apartment if there is a swimming pool within 1 km from the building, because when there was no swimming pool the chances was 27.3%, but the presence of swimming pool increased it to 55.2%.
new_data_1 <- data.frame(n.rooms = 3, build.year = 1990, postalcode = "A2", square_meters = 101, swimpool_w1km = 1, school_distance = 1.2, type = "Apartment", selling_price = 60)
predict(model_d_2, newdata = new_data_1, type = "response")
## 1
## 0.55191
I found out that, the probability of selling an apartment in one week is 67%, if the seller lower the asking price to 30 million ISK. Hence, when price is lowered, there is more chances of selling an apartment within a week.
new_data_2 <- data.frame(n.rooms = 3, build.year = 1990, postalcode = "A2", square_meters = 101, swimpool_w1km = 1, school_distance = 1.2, type = "Apartment", selling_price = 30)
predict(model_d_2, newdata = new_data_2, type = "response")
## 1
## 0.6685264
set.seed(233)
training <- d_2[sample(1:nrow(d_2),size=(0.75*nrow(d_2)),replace=FALSE),]
test <- d_2[-which(row.names(d_2) %in% row.names(training)),]
summary(training)
## n.rooms build.year postalcode square_meters swimpool_w1km
## Min. : 1.000 Min. :1900 A1: 70 Min. : 58.0 Min. :0.00000
## 1st Qu.: 2.000 1st Qu.:1961 A2:127 1st Qu.: 98.0 1st Qu.:0.00000
## Median : 3.000 Median :1990 A3:115 Median :110.3 Median :0.00000
## Mean : 3.316 Mean :1982 A4:159 Mean :112.1 Mean :0.04133
## 3rd Qu.: 4.000 3rd Qu.:2007 A5:129 3rd Qu.:123.4 3rd Qu.:0.00000
## Max. :222.000 Max. :2023 A6: 82 Max. :211.7 Max. :1.00000
## A7: 68
## school_distance type sold_within1week selling_price
## Min. :0.0010 Apartment:637 0:454 Min. : 32.71
## 1st Qu.:0.2003 House :113 1:296 1st Qu.: 55.60
## Median :0.4400 Median : 65.50
## Mean :0.5853 Mean : 67.88
## 3rd Qu.:0.7495 3rd Qu.: 76.69
## Max. :3.3540 Max. :234.36
##
Here I am leaving out the variable sold_within1week because the property has not been sold.
Selling_price = ß0 + ß1 x n.rooms + ß2 x postalcode + ß3 x square_meters + ß4 x swimpool_w1km + ß5 x build.year + ß6 x school_distance + ß7 x type
model1 <- lm(data = training, selling_price ~ n.rooms + postalcode + square_meters + swimpool_w1km + build.year + school_distance + type)
summary(model1)
##
## Call:
## lm(formula = selling_price ~ n.rooms + postalcode + square_meters +
## swimpool_w1km + build.year + school_distance + type, data = training)
##
## Residuals:
## Min 1Q Median 3Q Max
## -33.924 -8.033 -0.755 6.682 109.386
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -26.69385 61.13884 -0.437 0.66252
## n.rooms 0.06498 0.05654 1.149 0.25076
## postalcodeA2 -18.76925 1.95860 -9.583 < 2e-16 ***
## postalcodeA3 -6.50875 2.09214 -3.111 0.00194 **
## postalcodeA4 -7.69433 2.54359 -3.025 0.00257 **
## postalcodeA5 -7.21099 2.76655 -2.606 0.00933 **
## postalcodeA6 -14.07870 3.34836 -4.205 2.94e-05 ***
## postalcodeA7 -16.37902 3.37853 -4.848 1.52e-06 ***
## square_meters 0.55644 0.02061 26.992 < 2e-16 ***
## swimpool_w1km 0.10479 2.29233 0.046 0.96355
## build.year 0.02075 0.03159 0.657 0.51140
## school_distance 0.13437 0.97437 0.138 0.89035
## typeHouse 5.49084 1.28118 4.286 2.06e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12.36 on 737 degrees of freedom
## Multiple R-squared: 0.5632, Adjusted R-squared: 0.5561
## F-statistic: 79.19 on 12 and 737 DF, p-value: < 2.2e-16
I am using the plot() function for a quick diagnostics of the model.
The fitted values against residuals diagram, indicate that there is some problem with my model1. That is, there is structure in the residuals indicating non-linearity and non-constant variance. Again, the qq-plot indicates that the residuals of my model1 are not normally distributed.
plot(model1)
## Warning in sqrt(crit * p * (1 - hh)/hh): NaNs produced
## Warning in sqrt(crit * p * (1 - hh)/hh): NaNs produced
ggplot(model1, aes(x = .fitted, y = .resid)) +
geom_point() +
geom_hline(yintercept = 0)
plot(training$n.rooms, training$res,
xlab = "Predictor n.rooms",
ylab = "residuals")
abline(h=0)
plot(training$build.year, training$res,
xlab = "Predictor build.year",
ylab = "residuals")
abline(h=0)
plot(training$postalcode, training$res,
xlab = "Predictor postalcode",
ylab = "residuals")
abline(h=0)
plot(training$square_meters, training$res,
xlab = "Predictor square_meters",
ylab = "residuals")
abline(h=0)
plot(training$swimpool_w1km, training$res,
xlab = "Predictor swimpool_w1km",
ylab = "residuals")
abline(h=0)
plot(training$school_distance, training$res,
xlab = "Predictor school_distance",
ylab = "residuals")
abline(h=0)
plot(training$type, training$res,
xlab = "Predictor type",
ylab = "residuals")
abline(h=0)
qqnorm(model1$res,ylab="Raw Residuals")
qqline(model1$res)
Since diagnostics of model1 residuals shows the model is not appropriate, I am going to examine if the boxcox transformation of the response variable can bring a more suitable model than model1.
As the 95% confidence interval includes zero here, the Box-Cox transformation of selling_price will be based on λ = 0, for interpretability purposes. The selling_price variable will be transformed to log(selling_price).
boxcox(model1,plotit=T)
model2 <- lm(data = training, log(selling_price) ~ n.rooms + postalcode + square_meters + swimpool_w1km + build.year + school_distance + type)
summary(model2)
##
## Call:
## lm(formula = log(selling_price) ~ n.rooms + postalcode + square_meters +
## swimpool_w1km + build.year + school_distance + type, data = training)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.47119 -0.11095 -0.00113 0.10966 0.52977
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.1312046 0.8235767 3.802 0.000155 ***
## n.rooms 0.0007599 0.0007616 0.998 0.318696
## postalcodeA2 -0.2874152 0.0263835 -10.894 < 2e-16 ***
## postalcodeA3 -0.0902821 0.0281824 -3.203 0.001416 **
## postalcodeA4 -0.0944001 0.0342637 -2.755 0.006012 **
## postalcodeA5 -0.0839536 0.0372671 -2.253 0.024568 *
## postalcodeA6 -0.2069251 0.0451044 -4.588 5.27e-06 ***
## postalcodeA7 -0.2285115 0.0455108 -5.021 6.45e-07 ***
## square_meters 0.0074029 0.0002777 26.659 < 2e-16 ***
## swimpool_w1km -0.0017130 0.0308791 -0.055 0.955775
## build.year 0.0001736 0.0004255 0.408 0.683367
## school_distance 0.0101807 0.0131254 0.776 0.438203
## typeHouse 0.0759985 0.0172583 4.404 1.22e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1665 on 737 degrees of freedom
## Multiple R-squared: 0.5728, Adjusted R-squared: 0.5658
## F-statistic: 82.34 on 12 and 737 DF, p-value: < 2.2e-16
I examine again to see the residual v fitted values with diagram
There seems to be some improvement in model2. There is no indication of structure in the residuals or non-constant variance. Also the residuals seem to be approximately normally distributed when examining the normal qq plot. However, there is one observations that have a higher leverage than the rule of thumb suggested by Faraway (threshold=0.032).
I will observe more with jackknife residual
plot(model2)
## Warning in sqrt(crit * p * (1 - hh)/hh): NaNs produced
## Warning in sqrt(crit * p * (1 - hh)/hh): NaNs produced
ggplot(model2, aes(x = .fitted, y = .resid)) +
geom_point() +
geom_hline(yintercept = 0)
plot(training$n.rooms, training$res,
xlab = "Predictor n.rooms",
ylab = "residuals")
abline(h=0)
plot(training$build.year, training$res,
xlab = "Predictor build.year",
ylab = "residuals")
abline(h=0)
plot(training$postalcode, training$res,
xlab = "Predictor postalcode",
ylab = "residuals")
abline(h=0)
plot(training$square_meters, training$res,
xlab = "Predictor square_meters",
ylab = "residuals")
abline(h=0)
plot(training$swimpool_w1km, training$res,
xlab = "Predictor swimpool_w1km",
ylab = "residuals")
abline(h=0)
plot(training$school_distance, training$res,
xlab = "Predictor school_distance",
ylab = "residuals")
abline(h=0)
plot(training$type, training$res,
xlab = "Predictor type",
ylab = "residuals")
abline(h=0)
qqnorm(model2$res,ylab="Raw Residuals")
qqline(model2$res)
Observing the jackknife, I can see that the maximum and minimum values are -0.0008467366 and -9.338425 respectively.
Again, the Bonferroni corrected significance threshold is -4.013996.
jack_training <- rstudent(model2)
plot(jack_training,ylab="Jacknife Residuals of training",main="Jacknife Residuals")
abline(h=0)
jack_training[abs(jack_training)==max(abs(jack_training))]
## 104
## -9.338425
jack_training[abs(jack_training)==min(abs(jack_training))]
## 894
## -0.0008467366
range(rstudent(model2))
## [1] -9.338425 3.284627
qt (0.05/(750*2),649)
## [1] -4.013996
training$jack <- rstudent(model2)
threshold=qt(0.05/(nrow(training)*2),lower.tail=FALSE,df=(nrow(training)-length(model2$coefficients)-1))
I now turn my attention to cook distance to evaluate the model. If the Cook’s distance in this model is greater than 4/n then I can conclude that the value is an outlier.
t = thresold
t = 4/nrow(training)
cooks_distance <- cooks.distance(model1)
plot(cooks_distance)
outliers <- which(cooks_distance>t)
print(outliers)
## 780 907 595 468 310 662 15 376 322 814 416 793 165 429 104 461 499 452 597 471
## 25 64 101 106 131 141 167 193 214 230 239 348 356 358 429 446 448 466 503 523
## 684 148 835 798 792 109
## 524 559 566 589 599 734
training_clean <- training[-c(outliers),]
cat("\nFull dataset records: ",nrow(training))
##
## Full dataset records: 750
cat("\nFiltered dataset records: ", nrow(training_clean))
##
## Filtered dataset records: 724
Selling_price = ß0 + ß1 x n.rooms + ß2 x postalcode + ß3 x square_meters + ß4 x swimpool_w1km + ß5 x build.year + ß6 x school_distance + ß7 x type
model3 <- lm(data = training_clean , selling_price ~ n.rooms + postalcode + square_meters + swimpool_w1km + build.year + school_distance + type)
summary(model3)
##
## Call:
## lm(formula = selling_price ~ n.rooms + postalcode + square_meters +
## swimpool_w1km + build.year + school_distance + type, data = training_clean)
##
## Residuals:
## Min 1Q Median 3Q Max
## -26.427 -6.790 -0.683 6.726 28.835
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -22.16626 50.14171 -0.442 0.658571
## n.rooms 2.92721 0.35627 8.216 9.92e-16 ***
## postalcodeA2 -18.98089 1.60518 -11.825 < 2e-16 ***
## postalcodeA3 -7.98700 1.72137 -4.640 4.15e-06 ***
## postalcodeA4 -8.12618 2.08430 -3.899 0.000106 ***
## postalcodeA5 -6.51359 2.26551 -2.875 0.004160 **
## postalcodeA6 -16.18358 2.74553 -5.895 5.80e-09 ***
## postalcodeA7 -17.56467 2.76445 -6.354 3.75e-10 ***
## square_meters 0.35690 0.02415 14.775 < 2e-16 ***
## swimpool_w1km -1.84791 1.93413 -0.955 0.339690
## build.year 0.02524 0.02587 0.976 0.329459
## school_distance 0.91964 0.79663 1.154 0.248715
## typeHouse 4.60351 1.08543 4.241 2.52e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.891 on 711 degrees of freedom
## Multiple R-squared: 0.6159, Adjusted R-squared: 0.6094
## F-statistic: 95 on 12 and 711 DF, p-value: < 2.2e-16
Again, there is some improvement in model3. There is no indication of structure in the residuals. Also the residuals seem to be approximately normally distributed when examining the normal qq plot
plot(model3)
ggplot(model3, aes(x = .fitted, y = .resid)) +
geom_point() +
geom_hline(yintercept = 0)
From the plot i saw much improvement in all variables, and there was nothing suspicous to detect compared to the fist two models
plot(training_clean$n.rooms, training_clean$res,
xlab = "Predictor n.rooms",
ylab = "residuals")
abline(h=0)
plot(training_clean$build.year, training_clean$res,
xlab = "Predictor build.year",
ylab = "residuals")
abline(h=0)
plot(training_clean$postalcode, training_clean$res,
xlab = "Predictor postalcode",
ylab = "residuals")
abline(h=0)
plot(training_clean$square_meters, training_clean$res,
xlab = "Predictor square_meters",
ylab = "residuals")
abline(h=0)
plot(training_clean$swimpool_w1km, training_clean$res,
xlab = "Predictor swimpool_w1km",
ylab = "residuals")
abline(h=0)
plot(training_clean$school_distance, training_clean$res,
xlab = "Predictor school_distance",
ylab = "residuals")
abline(h=0)
plot(training_clean$type, training_clean$res,
xlab = "Predictor type",
ylab = "residuals")
abline(h=0)
Also the residuals seem to be approximately normally distributed when examining the normal qq plot
qqnorm(model3$res,ylab="Raw Residuals")
qqline(model3$res)
I am going to build regression model from all set of predictor variables by removing predictors based on p values less greater than 0.05 . In this case, the variables swimpool_w1km, build.year and school_distance were dropped from the model.
Despite, having a lesser R2 value, I believe there is no much difference between my model4 and model3, and hence i prefer to use model4 as my final model, since all variables are significant at the 0.05 alpha level.
Selling price = 27.11485 + (2.91387)n.rooms -18.53843(postalcodeA2) -7.35384(postalcodeA3) -6.78113(postalcodeA4) -5.10811(postalcodeA5) -13.48789(postalcodeA6) -14.97878(postalcodeA6) + 0.35708(square_meters) + 4.71228(type)
The estimate of residual standard error in y final model is 9.892.
model4 <- lm(data = training_clean , selling_price ~ n.rooms + postalcode + square_meters + type)
summary(model4)
##
## Call:
## lm(formula = selling_price ~ n.rooms + postalcode + square_meters +
## type, data = training_clean)
##
## Residuals:
## Min 1Q Median 3Q Max
## -26.6147 -6.8596 -0.7441 6.7416 29.0273
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 27.11485 2.38072 11.389 < 2e-16 ***
## n.rooms 2.91387 0.35533 8.201 1.11e-15 ***
## postalcodeA2 -18.53843 1.51180 -12.262 < 2e-16 ***
## postalcodeA3 -7.35384 1.54641 -4.755 2.40e-06 ***
## postalcodeA4 -6.78113 1.45591 -4.658 3.81e-06 ***
## postalcodeA5 -5.10811 1.50236 -3.400 0.000711 ***
## postalcodeA6 -13.48789 1.65017 -8.174 1.36e-15 ***
## postalcodeA7 -14.97878 1.72248 -8.696 < 2e-16 ***
## square_meters 0.35708 0.02409 14.823 < 2e-16 ***
## typeHouse 4.71228 1.08323 4.350 1.56e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.892 on 714 degrees of freedom
## Multiple R-squared: 0.6142, Adjusted R-squared: 0.6093
## F-statistic: 126.3 on 9 and 714 DF, p-value: < 2.2e-16
summary(model4)$r.squared
## [1] 0.6141888
There is a good fit for residuals in model4 and there is no indication of structure in the residuals diagram. Also the residuals seem to be approximately normally distributed when examining the normal qq plot
plot(model4)
ggplot(model4, aes(x = .fitted, y = .resid)) +
geom_point() +
geom_hline(yintercept = 0)
qqnorm(model4$res,ylab="Raw Residuals")
qqline(model4$res)
It can be seen that there is a positive relationship between the observed selling price and the predicted selling price in the test data from the diagram.
model.test <- lm(data = test , selling_price ~ n.rooms + postalcode + square_meters + type)
plot_data <- data.frame(Predicted_value = predict(model.test),
Observed_value = test$selling_price)
ggplot(plot_data, aes(x = Predicted_value, y = Observed_value)) +
geom_point() +
geom_abline(intercept = 0, slope = 1, color = "green")
The root mean square error for the test data is about 10.4 millions ISK
model.test <- lm(data = test , selling_price ~ n.rooms + postalcode + square_meters + type)
pred_selling_price <- predict(model.test, test)
test %>% summarize(RMSE(pred_selling_price, selling_price))
## RMSE(pred_selling_price, selling_price)
## 1 10.35646
The root mean square error for the test data is about 10.3 millions ISK
model.test2 <- lm(data = test , selling_price ~ n.rooms + postalcode + square_meters + swimpool_w1km + build.year + school_distance + type)
pred_selling_price.m.t.2 <- predict(model.test2, test)
test %>% summarize(RMSE(pred_selling_price.m.t.2, selling_price))
## RMSE(pred_selling_price.m.t.2, selling_price)
## 1 10.2935